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Abstract 

In this paper we study the problem of estimating the mixing coefficients between two random vari- 
ables. Three different mixing coefficients are studied, namely alpha-mixing, beta-mixing and phi-mixing 
coefficients. The random variables can either assume values in a finite set or the set of real numbers. We 
derive upper and lower bounds for both the alpha-mixing and the phi-mixing coefficients. Moreover, in 
case the marginal distributions of the two random variables are uniform, an exact expression is given for 
the phi-mixing coefficient. This situation arises when empirically generated samples are binned using 
percentile binning. We also prove analogs of the data-processing inequality from information theory for 
each of the three kinds of mixing coefficients. Then we move on to real-valued random variables, and 
show that by using percentile binning and allowing the number of bins to increase more slowly than 
the number of samples, we can generate empirical estimates that are consistent, i.e., converge to the 
true values as the number of samples approaches infinity. 

I. Introduction 

The notion of independence of random variables is central to probability theory. In [7, p. 8], 
Kolmogorov says: 

"Indeed, as we have already seen, the theory of probability can be regarded from the 
mathematical point of view as a special application of the general theory of additive 
set functions. 
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and 

"Historically, the independence of experiments and random variables represents the 
very mathematical concept that has given the theory of probability its peculiar stamp." 
In effect, Kolmogorov is saying that, if the notion of independence is removed, then probability 
theory reduces to just measure theory. 

Independence is a binary concept: Either two random variables are independent, or they are 
not. It is therefore worthwhile to replace the concept of independence with a more nuanced 
measure that quantifies the extent to which given random variables are dependent. In the case 
of stationary stochastic processes, there are various notions of 'mixing', corresponding to long 
term asymptotic independence. These notions can be readily adapted to define various mixing 
coefficients between two random variables. Several such definitions are presented in [4, p. 3], 
out of which three are of interest to us, namely the a-, (5- and 0-mixing coefficients. While 
the definitions themselves are well-known, there is very little work on actually computing (or at 
least estimating) these mixing coefficients in a given situation. The /3-mixing coefficient is easy 
to compute but this is not the case for the a- and the 0-mixing coefficients. 

Against this background, the present paper makes the following specific contributions: 

1) For discrete random variables, simple upper and lower bounds are derived for both the a- 
and the 0-mixing coefficients. 

2) In the special case where the discrete random variables have uniform marginal distributions, 
a closed-form formula is given for the 0-mixing coefficient. This situation arises when two 
real-valued random variables are sampled, and the sampled values are discretized using 
percentile binning, that is, the end points of the grids are chosen such that the marginals are 
(nearly) uniform. It is well-known in the statistics literature that this kind of 'data-dependent 
partitioning', also referred as 'partitioning into statistically equivalent blocks', offers better 
performance than using a fixed partitioning for discretization; see the introduction of [9]. 

3) We study the case where X, Y, Z are discrete random variables, and X, Z are conditionally 
independent given Y, or equivalently, X — > Y — > Z is a short Markov chain. In this case 
a well-known inequality from information theory [3, p. 34] states that 

I(X,Z) <mm{I(X,Y),I(Y,Z)}, (1) 

where /(•,•) denotes the mutual information. This inequality is usually referred to as the 
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'data processing inequality (DPI)'. We state and prove analogs of the DPI for each of the 
a-, (5- and 0-mixing coefficients. 

4) Suppose X, Y are real- valued random variables whose joint distribution has a density with 
respect to the Lebesgue measure, and that {(x 1: y-i), . . . , (x h y t )} are independent samples 
of (X,Y). If we compute the empirical joint distribution of (X,Y) from these samples, 
then the Glivenko-Cantelli Lemma states that the empirical joint distribution converges with 
probability one to the true joint distribution; in other words, the empirical distribution gives 
a consistent estimate. However, it is shown here that if the empirical distribution is used 
to estimate the mixing coefficients, then with probability one both the estimated /3-mixing 
coefficient and the estimated 0-mixing coefficient approach one as I — y oo, irrespective of 
what the true value might be. Thus a quantity derived from a consistent estimator need 
not itself be consistent. 

5) On the other hand, if we bin the I samples into h bins such that the quantized versions 
of X and Y have nearly uniform distributions, and choose k t in such a way that k { — y oo 
and ki/l — y as I — y oo, and a few technical conditions are satisfied, then the empirically 
estimated a-, (5- and 0-mixing coefficients converge to their true values as I — y oo, with 
probability one. 

The problems of efficiently computing mixing coefficients and proving analogs of the data 
processing inequality are not just of academic interest. Recent work on reverse-engineering 
genome-wide interaction networks from gene expression data is based on using the 0-mixing 
coefficient as a measure of the interaction between two genes; see [11]. If there are n genes in 
the study, this approach requires the computation of n 2 0-mixing coefficients. So for a typical 
genome-wide study involving 20, 000 genes, it becomes necessary to compute 400 million <fr- 
mixing coefficients. Hence efficient computation is mandatory in order to have a practically 
viable implementation. The approach suggested in [13], [11] is to compute all pair-wise <fi- 
mixing coefficients, start with a complete directed graph on n nodes, and then to use the analog 
of the data processing inequality for the 0-mixing coefficient to prune the network. The results 
presented in this paper provide the analytical justification for the approach in [11]. 
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II. Definitions of Mixing Coefficients 

The notion of mixing originated in an attempt to establish the law of large numbers for 
stationary stochastic processes that are not i.i.d. General definitions of the a-, j3- and 0-mixing 
coefficients of a stationary stochastic process can be found, among other places, in [12, pp. 34- 
35]. The a-mixing coefficient was introduced by Rosenblatt [10]. According to Doukhan [4, p. 
5], Kolmogorov introduced the j3 -mixing coefficient, but it appeared in print for the first time in 
a paper published by some other authors. The 0-mixing coefficient was introduced by Ibragimov 
[6]. 

Essentially, all notions of mixing try to quantify the idea that, in a stationary stochastic process 
of the form {X t }^ : _ 00 , the random variables X t and X T become more and more independent as 
\t — t| approaches infinity, in other words, there is an asymptotic long-term near-independence. 
However, these very general notions can be simplified and readily adapted to define mixing 
coefficients between a pair of random variables X and Y } Though they can be defined for 
arbitrary random variables, in the interests of avoiding a lot of technicalities we restrict our 
attention in this paper to just two practically important cases: real-valued and discrete random 
variables. We first define mixing coefficients between real- valued random variables, and then 
between discrete random variables. 

Definition 1: Suppose X and Y are real- valued random variables. Let B denote the Borel 
cr-algebra of subsets of R. Then we define 

a(X, Y) := sup | Pr{X G S&Y G T} - Pr{X G S} ■ Pr{Y G T}\. (2) 
4>{X\Y) := sup |Pr{X G S\Y G T} - Pr{X G S}\ 

S,TeB 



sup 

S,T&B 



Pr { XeSLYeT}_ pr{XeS} 



(3) 



Pr{Y G T} 

In applying the above definition, in case Pr{F G T} = 0, we use the standard convention 
that 

Pr{X G S\Y G T} = Pr{X G S}. 

'Strictly speaking, mixing is a property not of the random variables X and Y, but rather of the a-algebras generated by X 
and Y. This is how they are defined in [4]. 
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Note that the a-mixing coefficient is symmetric: a(X,Y) = a(Y,X). However, in general 

<j>(X\Y) ± HY\X). 

The third coefficient, called the /3-mixing coefficient, has a somewhat more elaborate definition, 
at least in the general case. Let 9 denote the probability measure of the joint random variable 
(X,Y), and let p, v denote the marginal measures of X and Y respectively. Note that 9 is a 
measure on M 2 while p, v are measures on R. If X and Y were independent, then 9 would equal 
p x v, the product measure. With this in mind, we define 

(3(X,Y)=p(9,pxv), (4) 

where p denotes the total variation distance between two measures. The /3-mixing coefficient is 
also symmetric. 

Next we deal with discrete random variables, and for this purpose we introduce some notation 
that is used throughout the remainder of the paper. The most important notational change is that, 
since probability distributions on finite sets are vectors, we use bold-face Greek letters to denote 
them, whereas we use normal Greek letters to denote measures on R or M 2 . For each integer n, 
let S n denote the n-dimensional simplex. Thus 

n 

S n : = { v G W 1 : Vi > Vi, = 1}. 

i=i 

If A = {ai, . . . , a n } and n G §„, then defines a measure P M on the set A according to 

n 

i=i 

where I s (-) denotes the indicator function of S. To avoid more notation, we will write fJi(S) 
instead of the more precise P^(S). 

Suppose /i,!/£§„ are probability distributions on a set A of cardinality n. Then the total 
variation distance between and v is defined as 

p(H,v) := max \p(S) - v(S)\. 

It is easy to give several equivalent closed-form formulas for the total variation distance. 

n n 

p(/j,, u) = 0.5||/x - i/||i = ^(Pi - Vi)+ = - ^(Pi - Vi)-, 

i=i i=i 

where as usual (•)+ and (•)_ denote the nonnegative and the nonpositive parts of a number: 

(x)+ = max{i, 0}, (x)- = min{a;, 0}. 
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Now suppose A, B denotes sets of cardinality n, m respectively, and that 6 §„,i/ 6 § m . 
Then the distribution ip e S nm defined by ^ = /z;z/j is called the product distribution on 
A x B. In the other direction, if e § nm is a distribution on A x B, then A e §„, #b £ §m 
defined respectively by 

m n 
j=l i=l 

are called the marginal distributions of 9 on A and B respectively. 

The earlier definitions of mixing coefficients become quite explicit in the case where X, Y are 
discrete random variables assuming values in the finite sets A, B of cardinalities n, m respectively. 
In this case it does not matter whether the ranges of X, Y are finite subsets of R or some abstract 
finite sets. Definition 1 can now be restated in this context. Note that, since A, B are finite sets, 
the associated cr-algebras are just the power sets, that is, the collection of all subsets. 

Definition 2: With the above notation, we define 

a(X,Y):= max \0(S x T) — fi(S)v(T)\, (5) 

SCA,TCB 

(3(X,Y):=p(0,^xiy), (6) 
0(S x T) 



6(X\Y) := max 

5CA,TCB 



-tx(S) 



(7) 



u(T) 

Whether X, Y are real-valued or discrete random variables, the mixing coefficients satisfy the 
following inequalities: 

< a(X,Y) < p(X,Y) < mm{(f)(X\Y),(f)(Y\X)} < max{0(X|F), <f>(Y \X)} < 1. 

Also, the following statements are equivalent: 

1) X and Y are independent random variables. 

2) a(X,Y) = 0. 

3) P(X,Y) = 0. 

4) 4>{X\Y) = 0. 

5) 4>{Y\X) = 0. 
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III. Computation of Mixing Coefficients for Discrete Random Variables 

In this section, we present explicit upper and lower bounds for the a- and 0-mixing coefficients, 
as well as an exact formula for the 0-mixing coefficient in the case where one of the random 
variables has a uniform marginal distribution. This situation arises when a real-valued random 
variable is quantized using percentile binning. 

From the definitions, it is obvious that f3(X, Y) can be readily computed in closed form. As 
before, let us define if> = /x x /i. to be the product distribution of the two marginals, and define 

Then it is obvious that 

7% Tfl Yl Tfl TL Tfl 

0(X, Y) := p(G, V) = 0.5 E l7« I = E EM + = - E X>«)- 

i=l j=l i=l j=l i=l j=l 

On the other hand, computing a(X, Y) or 4>(X\Y) directly from Definition 2 would require 
2 n+rn computations, since S, T must be allowed to vary over all subsets of A, B respectively. It 
is shown later that the number of computations can be brought down to 0(2 m ) but this is still 
exponential. Thus the objectives of the present section are to derive explicit upper and lower 
bounds for these mixing coefficients, and also to derive an exact formula for (f>(X\Y) in case v 
is the uniform distribution. 

For this purpose we recall the definition of the matrix induced norm. For indices i and j, let 
7 l ,7 J denote respectively the i-th row and j-th column of the matrix T. The quantity 

n 

\\V\\ a := max | 7ij | = max ||-yj- ||x 

i=l 

is called the fx-induced matrix norm of T. It is well-known that 

llr-ll Mr- II H^ll 1 

1 Li = max 1 v i = max — — n — . 

||v||i<l v^o ||v |i 

With this notation we are ready to state the main results of this section. 
Theorem 1: We have that 

0.5||r|| a < a(X, Y) < 0.25m||r|| a . (8) 

Theorem 2: We have that 

< cf>(X\Y) < ^Ek. (9) 
maxj Vj minj Vj 
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In particular, if v is the uniform distribution on B, then 

0(x|y) = o.5m||r|| i i. (io) 

In proving Theorems 1 and 2, the first step is to get rid of the absolute value signs in the 
definitions of the a- and 0-mixing coefficients. 
Theorem 3: It is the case that 

a(X,Y) = 5 max cB [0(S x T) - fi(S)v(T)] , (11) 



6{X\Y) = max 

5CA,TCB 



(12) 



Proof: Define 



TZ a := {0(S x T) - fj.(S)u(T), SCA,TC 



Then lZ a is a subset of the real line consisting of at most 2 n+m elements. Now it is claimed that 
the set lZ a is symmetric; that is, x G lZ a implies that — x G 1Z a . If this claim can be established, 
then (11) follows readily. So suppose x G lZ a , and choose S C A, T C B such that 

0(S x T) — n(S)v(T) =x. 

Let S' c denote the complement of S in A. Then, using the facts that 

tx(S c ) = 1-/1(5), 

0(,S C x T) = 6>(A x T) — 0(5 x T) = - 0(5 x T), 

it is easy to verify that 

0(5 C x T) — ^(5>(T) = -x. 

So 7£ Q is symmetric and (11) follows. By analogous reasoning, the set 

:={^p-/i(5):5CA,Tc: 

is also symmetric, which establishes (12). ■ 
To facilitate the proofs of Theorems 1 and 2, we introduce a map from the power set of A 
into {0, l} n . For a subset S C A, we define h(5) G {0, l} n by 

1, if cij G 5, 
0, if Oj 5. 



hi(S) 
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The map h : 2 B — > {0, l} m is defined analogously. With these definitions, it is obvious that, for 
S C A, T C B, we have 

= [h(S)]V = j*«h(S),i/(T) = [h(T)]V = i/h(T), 

0(5 x T) = [h(S)]'eh(T), 

where = [%]. By replacing h(S') and h(T) by arbitrary binary vectors a e {0,1}™, b e 
{0, l} m , it readily follows from (11) and (12) that 

a(X,Y)= max aTb, (13) 

ae{0,l}",bG{0,l} m 

a*Fb 

0(X|F)= max — -. (14) 
ae{o,i}™,be{o,i} m v b 

Now we are in a position to prove Theorems 1 and 2. 

Proof of Theorem 1: It is obvious that 

max a*rb = max max a'Tb. 

a€{0,l} n ,be{0,l} m b€{0,l} m ae{0,l} n 

Now, for fixed b e {0, l} m , it is obvious that 



max a*rb = V(yb) + , 

1=1 



corresponding to the choice 



as = 



1, if yb > 0, 
0, if 7*b < 0. 



Therefore 



a(X,Y)= max V(yb)+. (15) 
be{o,i}™^ 

Next, let e denote a column vector consisting of all ones, with the subscript denoting its 
dimension, and observe that 

/i t = e^e = <W <r = n , similarly Te m = m . 
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Therefore, for any vector v £ M m , it follows that 

n 

<rv = o ^Vv = o 



i=l 



5> v )+ + = 

i=l i=l 

E(7V) + = -J>V)_ 

i=l i=l 
n n 

^( 7 V) + = 0.5 l7*v| = 0.5||rv||i. (16) 



i=l i=l 



So in particular it follows that 



a(X,F)= max 0.5||rb||i (17) 

bg{0,l} m 

To prove the lower bound in (8), choose an index jo £ {1, . . . ,m} such that ||'Y,- ||i — ||r||u- 
Then choose b £ {0, l} m to be the binary vector with bj = 1 and bj = for all j ^ j . Now 
it follows from (16) that 

n n 

£(Ybo) + = 0.5^|Yb | 

i=l i=l 



0-5^|7^ol=0.5|| 7 , ||i = 0.5||r|| a . 



i=l 



Hence the maximum over all b £ {0, l} m is at least equal to this much. 
To prove the upper bound in (8), observe from the definition that 



T a = max 7, i. 

l<j<m J 



Now for any b £ {0, l} m , we have 



(18) 



Therefore 

m m 

0.511^11! = 0.5|| X)7A-Ili < 0-5 Y^^jW 

3=1 U=l 

The proof is completed by showing that an optimal b can be chosen with no more than m/2 
nonzero entries. Choose a b* £ {0, l} m that achieves the maximum in (17). If b* has m/2 or 
fewer nonzero entries, we are done, because we can substitute into (18) and conclude that 

0.5||rb*||i < 0.25m max \\~yAU = 0.25m||r|| a . 

l<j'<m J 
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If b* has more than to/2 nonzero entries, define b = e m — b*, and note that b has fewer than 
to/2 nonzero entries. Also, since Te m = m , it follows that Tb = — Tb*. So by earlier reasoning 

||rb||i = ||rb*||i 

and the bound follows. ■ 
Proof of Theorem 2: By reasoning analogous to that in the proof of Theorem 1, we arrive at 

aTb 



<f>{X\Y) = max 



ae{0,l}",b€{0,l} m v l b 

a *rb 



max max 



be{o,i} m ae{o,i} n v l b 
'7*b 

be{o,i} 



x E (h )■ (19) 



To prove the lower bound, choose an index jq such that H'Yjolli = ||r||ii, and choose bo G 
{0, l} m such that b jo = 1 and bj = for all j ^ j . Then 

Yh \ 1 



o.5||r|| a 
o.5||r|L 

maxj z/j 



To prove the upper bound, note that for all b e {0, l} m , we have 

^ ^*b ^ i/b i/b 

j=l i=l 

Now we change the variable of optimization from b to v := Diag(i>)b, and use the fact that 
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the induced matrix norm || ■ \\ n is submultiplicative. This leads to 

1 1 rb 1 1 

MX\Y) = 0.5 max ^— ^ 
be{o,i} m vb 

<r ll rb lli 

< 0.5 max — — — — 

beK m \ub\ 

= 0.5 m a X » r ' D y V ''' 

veiR™ v i 

= 0.5||r[Diag( I /)]- 1 ||, 1 

< O.SHrlla-UfDiag^)]- 1 !!,! 

Qj||rjjji 

min,, Uj 

Finally, if u is the uniform distribution, then min,, Vj = maxj i/j = 1/m. So the two inequalities 
in (9) become equalities. ■ 

For later use, we collect (15) and (19) and state them a separate theorem. 

Theorem 4: With all notation as above, we have 



a(X, Y) = max V[Pr{X = ikY E T} - Pr{X = i} Pr{Y G T}]+. (20) 

~ i=i 

and 

n 

4>{X\Y) = max^[Pr{X = i\Y G T} — Pr{X = (21) 

~ i=i 

IV. Data Processing-Type Inequalities for Mixing Coefficients 

In this section we study the case where two random variables are conditionally independent 
given a third, and prove inequalities of the data processing-type for the associated mixing 
coefficients. The nomenclature 'data processing-type' is motivated by the well-known data 
processing inequality in information theory. 

Definition 3: Suppose X, Y, Z are discrete random variables assuming values in finite sets 
A, B, C respectively. Then X, Z are said to be conditionally independent given Y if 

Pr{X = ikZ = k\Y = j} = Pr{X = i\Y = j} Pr{Z = j\Y = k}, Vi G A, j G B, k G C. 

(22) 

If X, Z are conditionally independent given Y, we denote this by (X _L Z)\Y . Some authors 
also write this as l X — > Y — > Z is a short Markov chain', ignoring the fact that the three random 
variables can belong to quite distinct sets. In this case, it makes no difference whether we write 
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X — >Y— > Z ov Z — >Y— > X, because it is obvious from (22) that conditional independence 
is a symmetric relationship. Thus 

(X ±Z)\Y (Z ±X)\Y. 

Also, from the definition, it follows readily that if (X _L Z)\Y, then 

Pr{X G SkZ G U\Y = j} = Pr{X G S\Y = j}Pr{Z G U\Y = j}, V5CA,jeB,[/C C. 

(23) 

However, in general, it is not true that 

Pr{X G SkZ G U\Y G T} = Pr{X G S\Y G T} Pr{Z G U\Y G T}, VS C A, T C B, [/ C C. 

In fact, by setting T = B, it would follow from the above relationship that X and Z are 
independent, which is a stronger requirement than conditional independence. 

Given two random variables X, Y with joint distribution and marginal distributions /x, v of 
X, Y respectively, the quantity 

n 
i=l 

is called the entropy of /x, with analogous definitions for H{v) and H{6); and the quantity 

I(X,Y)=H(ii) + H(v)-H(d) 

is called the mutual information between X and Y. It is clear that I(X,Y) = I(Y,X). The 
following well-known inequality, referred to as the data-processing inequality, is the motivation 
for the contents of this section; see [3, p. 34]. Suppose (X _L Z)\Y. Then 

/(X, Z) < min{/(X, Y), I{Y, Z)}. (24) 
Theorem 5: Suppose (X _L Z)\Y. Then 

a(X, Z) < min{a(X, Y), a(Y, Z)}. (25) 
Theorem 6: Suppose (X _L Z)\Y. Then 

P(X, Z) < min{/3(X, Y), (3{Y, Z)}. (26) 
Theorem 7: Suppose (X _L Z)\Y. Then 

4>{X\Z) < min{0(X|r), 4>{Y\Z)}, (27) 
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<t>{Z\X) < min{0(Z|Y), <j>(Y\X)}. (28) 
Proof of Theorem 5: Let S C A, U C C be arbitrary, and define 

r a (S, U) := Pr{X G S&Z G [/} - Pr{X G 5} Pr{Z G U}. 

Then 

m 

r Q (S,C/) = ^[Pi{X e SkY = jkZ eU} -Pi{X e SkY = j}Pi{Z eU}} 

3=1 
m 

= J^[Pr{X G 5|y = j} Pr{Z G C/|Y = j} Pr{F = j} 

3=1 

- Pr{X G S\Y = j}Pi{Y = j}Pr{Z G U}] 

m 

3=1 
m 

< Pr i x e S \ Y = j}[ Ft { z e ukY = j) - Pi i Y = j) Fi i z e u }}+ 

3=1 
m 

< ^[Pr{Z G UkY = j} - Pr{Y = j} Pr{Z G U}} + 

3=1 



< max ^[Pr{Z G UkY = j) - Pi{Y = j} Pi{Z G U}]. 

~ 3=1 

= a(Y,Z). 



Since S and U are arbitrary, this implies that a(X, Z) < a(Y, Z) whenever X — > Y — > Z is 
a short Markov chain. Since X — > Y — > Z is the same as Z — > Y — > X, it also follows that 
a(Z,X) < a(Y,X). Finally, since a is symmetric, the desired conclusion (25) follows. ■ 
Proof of Theorem 6: Suppose that A, B, C have cardinalities n, m, I respectively. (The symbols 
n, m have been introduced earlier and now / is introduced.) Let S denote the joint distribution 
of (X, y, Z), £ the joint distribution of (X, Z), rj the joint distribution of (Y, Z), and as before, 
8 the joint distribution of (X, Y). Let £ the marginal distribution of Z, and as before, let /z, u 
denote the marginal distributions of X and Y. Finally, define 

Cjk = ^ = Pr{Z = k\Y=j}. 

V 3 

As can be easily verified, the fact that (X _L Z)\Y (or (22)) is equivalent to 

r _ QijVjk p. y/. . 7 

Oijk — — "ijCjk, vi, J, K. 

V 3 
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Also note the following identities: 

n m m 

/Z 0i i = v i> /Z 6i i = ^' zZ 5i i k = Vi ' J ' k - 
i=i j=i j=i 

Now it follows from the various definitions that 

n I 

p(x,z) = 

i=l k=l 
n I / m 

= /ZzZ[ SZfe - 

i=l fc=l \j=l 

n I m 

< /ZzZzZ^ k ^ ) + 

i=l k=l 3=1 
n I m 

= /ZzZzZ (% c ^ ~ OijZk). 

1=1 fc=l jr' = l 



i=l 



= EE 

fc=i j=i 

l m 

k=l j=l 
I m 

= ^Z^iVjk - Vjtk)+ 

k=l 3=1 

= mz). 

Now the symmetry of /?(-, •) serves to show that f3(X, Z) < f3(X, Y). Putting both inequalities 
together leads to the desired conclusion. ■ 
Proof of Theorem 7: Suppose (X _L Z)\Y . Since the 0-mixing coefficient is not symmetric, it 
is necessary to prove two distinct inequalities, namely: (i) <j>(X\Z) < <j>(X\Y), and (ii) (j>(X\Z) < 
<P{Y\Z). 

Proof that <f>{X\Z) < <f)(X\Y): For S C A, define 



and observe that 



r+(S) :=maxPr{X ES\YET}, 



</ ) (X\Y) = m^[r <p (S)- f i(S)]. 
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For a given S C A, choose T* = T*(S) C 1 such that 

Pr{X G S\Y G T*} = r<t,(S). 
Suppose U C C is arbitrary. Then 

rrt 

Pr{X e SLZ e U} = Fr i x e skY = 3 hz e u } 

m 

= ^Pt{X e S\Y = j}Pt{Z eU\Y = j}Pt{Y 

m 

= ^¥t{X e S\Y = ]}¥i{Z eUkY = ]} 

3=1 

m 

< r<j> (S)J2^{ZeULY = 3 } 
3=1 

= r<t) (S)Pi{ZeU}. 
Dividing both sides by Pr{Z G [/} leads to 

Pr{X G 5|Z eU}< r^S), 
Pr{X G 5|Z G £/} — /i(S') < r (S) - /Lt(S) < 
Proof that 0(X|Z) < 0(V|Z): Let us define 

c(S, U) := Pi{X G S\Z G U} — fx(S), 
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and reason as follows: 

c(S,U) = Pr{X G S\Z G U} — Pr{X G S} 

m 

= J^[Pr{X G SLY = j\Z G U} — Pi{X G SkY = j}} 

3=1 

m 

= ^[Pr{X G 5|y = j&Z G U} Pr{Y = j\Z G U} - Pi{X G 5|F = j} Pr{Y = j}} 
3=1 

m 

= J^Pr{X G S\Y = j}[Pr{Y = j\Z G U} - Pr{Y = j}} 

3=1 

m 

< Fl i X e S \ Y = 3}[^{y = j\ZeU}- Pr{Y = j}} + 

3=1 
m 

< ^[Pr{F =j\ZeU}-Pv{Y = j}] + 

3=1 

m 

< max^[Pr{F = j&ZGf/}-Pr{r = j}] + 

~ 3=1 

= <P{Y\Z). 

Since the right side is independent of both S and U, the desired conclusion follows. ■. 

V. Inconsistency of an Estimator for Mixing Coefficients 

Suppose X, Y are real-valued random variables with some unknown joint distribution, and 
suppose we are given an infinite sequence of independent samples {(x i: yi) : i = 1,2, . . .}. The 
question studied in this section and the next is whether it is possible to construct empirical 
estimates of the various mixing coefficients that converge to the true values as the number of 
samples approaches infinity. 

Let 

$ X Y (a, b) = Pr{X < a&Y < b} 

denote the true but unknown joint distribution function of X and Y, and let ^ x (-),^y(-) 
denote the true but unknown marginal distribution functions of X, Y respectively. Using the 
samples{ ( x h Di)-, i — 1)2,...}, we can construct three 'stair-case functions' that are empirical 
estimates of $x, $y and $x,y based on the first / samples, as follows: 

1 ' 

$x(a;0:=y5>K<»}> ( 29 ) 
i=i 
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1 ' 

<MM) := t 5>fe< 6 }, (30) 
1 i= i 

1 ' 

$x,y(a, 6; : = y J K<«&k<H> ( 31 ) 
f i=i 

where as usual J denotes the indicator function. Thus counts the fraction of the first / 

samples that are less than or equal to a, and so on. With this construction, a well-known result 
called the Glivenko-Cantelli lemma (see [5], [2] or [8, p. 20]) states that the empirical estimates 
converge uniformly and almost surely to their true functions as the number of samples I — > oo. 
Thus <&x,y is a consistent estimator of the true joint distribution. Thus one might be tempted to 
think that an empirical estimate of any (or all) of the three mixing coefficients based on §x,y 
will also converge to the true value as / — > oo. The objective of this brief section is to show that 
this is not so. Hence estimates of mixing coefficients derived from a consistent estimator of the 
joint distribution need themselves be consistent. 

Theorem 8: Suppose $x,y is defined by (31), and that xi ^ xj and yi ^ yj whenever i ^ j. 
Let Pi denote the /3-mixing coefficient associated with the joint distribution $x,y(-, •; 0- Then 

A = (J - 1)//. 

Proof: Fix the integer I in what follows. Note that the empirical distribution &x,y (•,•', I) 
depends only the totality of the I samples, and not the order in which they are generated. 
Without loss of generality, we can replace the samples (x 1: . . . ,x n ) by their 'order statistics', 
that is, the same samples arranged in increasing order, and do the same for the Thus the 
assumption is that x\ < x 2 < ■ ■ ■ < x\ and similarly yi < y 2 < . . . < yi. With this convention, 
the empirical samples will be of the form {(xi,y w ^)), . . ., (^,Z/tt(o)} f° r some permutation it 
of {1, . . . , /}. Therefore the probability measure associated with the empirical distribution $ is 
purely atomic, with jumps of magnitude l/l at the points {(xi, y n (i)), ■ ■ ■ , Utv(i))}- So we can 
simplify matters by replacing the real line on the X-axis by the finite set {xi, . . . , x{\, and the 
real line on the F-axis by the finite set {yi, . . . ,y t }. With this redefinition, the joint distribution 
assigns a weight of l/l to each of the points (x i: y^i)) and a weight of zero to all other points 
(xi,yj) whenever j ^ 7r(z), while the marginal measures /x, v of X and Y will be uniform on 
the respective finite sets. Thus the product measure fi x v assigns a weight of l/l 2 to each of 
the l 2 grid points (xi,yj). From this, it is easy to see that 

A = p(0, M x i/) = (Z - l)/l. 
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This is the desired conclusion. ■ 
Corollary 1: Suppose the true but unknown distribution $x,y has density with respect to the 

Lebesgue measure. Then $ — > 1, — > 1 almost surely as I — > oo. 

Proof: If the true distribution has a density, then it is nonatomic, which means that with 

probability one, samples will be pairwise distinct. It now follows from Theorem 8 that 

4>i > A — ' — )> 1 as / — > oo. 
This is the desired conclusion. ■ 

VI. Consistent Estimators for Mixing Coefficients 

The objective of the present section is to show that a simple modification of the 'naive' 
algorithm proposed in Section V does indeed lead to consistent estimates, provided appropriate 
technical conditions are satisfied. 

The basic idea behind the estimators is quite simple. Suppose that one is given samples 
{(xi,yi),i > 1} generated independently and at random from an unknown joint probability 
measure 9 G M(M?). Given I samples, choose an integer k\ of bins. Divide the real line into k\ 
intervals such that each bin contains \l/ki\ or [l/k t \ + 1 samples for both X and Y. In other 
words, carry out percentile binning of both random variables. One way to do this (but the proof 
is not dependent on how precisely this is done) is as follows: Define mi — [l/ki\,r — I — kimi, 
and place mi + 1 samples in the first r bins and m t samples in the next mi — r bins. This gives 
a way of discretizing the real line for both X and Y such that the discretized random variables 
have nearly uniform marginals. With this binning, compute the corresponding joint distribution, 
and the associated empirical estimates of the mixing coefficients. The various theorems below 
show that, subject to some regularity conditions, the empirical estimates produced by this scheme 
do indeed converge to their right values with probability one as I — > oo, provided that mi — > oo, 
or equivalently, ki/l — > 0, as / — > oo. In other words, in order for this theorem to apply, the 
number of bins must increase more slowly than the number of samples, so that the number of 
samples per bin must approach infinity. In contrast, in Theorem 8, we have effectively chosen 
ki = I so that each bin contains precisely one sample, which explains why that approximation 
scheme does not work. 

To state the various theorems, we introduce a little bit of notation, and refer the reader to [1] 
for all concepts from measure theory that are not explicitly defined here. Let Ai(M), M.(M. 2 ) 
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denote the set of all measures on R or R 2 equipped with the Borel a-algebra. Recall that if 
6,i] e M(M) or M(R 2 ), then 9 is said to be absolutely continuous with respect to r), denoted 
by 9 <C i], if for every measurable set E, r)(E) = =>- p{E) = 0. 

Next, let 9 denote the joint probability measure of (X, Y), and let p, v denote the marginal 
measures. Thus, for every measurable 2 subset SCI, the measure p(S) is defined as 9(S x R) 
and similarly for all TCR, the measure v{T) is defined as 0(R x T). Now the key assumption 
made here is that the joint measure 9 is absolutely continuous with respect to the product measure 
px v. In the case of discrete random variables, this assumption is automatically satisfied. Suppose 
that for some pair of indices i,j, it is the case that pi • Vj = 0. Then either pi = or Vj = 0. If 
Hi = 0, then it follows from the identity ., = /ij that ^jj/ = for all f, and in particular 
9ij = 0. Similarly if i>j = 0, then it follows from the identity = z/^ that = for all %' , 

and in particular % = 0. In either case it follows that % = 0, so that « x 1/. However, in 
the case of real random variables, this need not be so. For example, replace Ixlby the unit 
square, and let 9 be the diagonal measure. Then both marginals v are the uniform measures 
on the unit interval, and the product p, x v is the uniform measure on the unit square - and 9 
is singular with respect to the uniform measure. 

Next we introduce symbols for the various densities. Since 9 p x v, it follows that 9 has 
a Radon-Nikodym derivative with respect to p, x v, which is denoted by /(•,•). So for any sets 
S,TCR, it follows that 



9(SxT)= / f(x,y)du(y)dp(x) = / / f(x,y)dp(x)du(y). 

JSJT JTJS 

For any TCR with z/(T) > 0, the conditional probability Pr{X G S\Y e T} is given by 

Pr{X G S&F G T} 0(5 x T) 



Pr{X G G T} = 



Pr{F G T} i/(T) 



dp(x). 



Theorem 9: With the above notation and conditions, the empirically estimated /3-mixing co- 
efficient $1 converges almost surely to the true value j3 as I — > 00, provided that k t — > 00 and 

— >■ as I — > 00. 

2 Hereafter we drop this adjective; it is assumed that all sets that are encountered are measurable. 
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Theorem 10: Suppose that the density /(•,•) belongs to L^R 2 ), and that k t /l — > as I — > oo. 
Then the empirically estimated a-mixing coefficient a t converges almost surely to the true value 
a as I — > oo, and the empirically estimated 0-mixing coefficient 0/ converges almost surely to 
the true value as I — > oo. 

Note that the density /(•,•) G Li(R 2 ,/i x v). So the sequence of empirical estimates fa 
converges to its true value without any additional technical conditions. The sequences of empirical 
estimates a t and 4>i converge to their true values provided the density / is bounded almost 
everywhere. This condition is intended to ensure that conditional densities do not 'blow up'. 
In the case of discrete variables, we have already seen that the condition 9 <C ji x v holds 
automatically, which means that the 'density' fijOij/ifiifj) is always well-defined. Since there 
are only finitely many values of % and j, this ratio is also bounded. However, in the case of 
real-valued random variables, this condition needs to be imposed explicitly. 

The proofs of these two theorems are based on arguments in [9], [14]. In the proof of Theorem 
9, we can use those arguments as they are, whereas in the proof of Theorem 10, we need to 
adapt them. To facilitate the discussion, we first reprise the relevant results from [9], [14]. 

Definition 4: Let (Q, J 7 ) be a measurable space, and let Q be a probability measure on (Q, J 7 ). 
Suppose {7i, . . . , I L } is a finite partition of Q, and that {l[ m ^ , . . . , I'ff 1 } is a sequence of partitions 
of f2. Then {l[ m \ . . . , 1^} is said to converge to {Ii,I L } with respect to Q if, for every 
probability measure P on (£7, J 7 ) such that P <ti Q, it is the case that 

P(l\ m) ) -> P(Ii) asm^oo. 

See [14, Definition 1]. 

Theorem 11: Suppose Q is a probability measure on (R, B) that is absolutely continuous with 
respect to the Lebesgue measure, L is a fixed integer, and that . . . , I L } is an equiprobable 
partitioning of R. In other words, choose numbers 

— oo = a < ai < • • • < Ol-i < a L = +oo 

such that the semi-open intervals U = (a;_i,aj] satisfy 

Q(I i ) = l/L,i = l,...,L. 

Suppose {yi, . . . ,y rn } are i.i.d. samples generated in accordance with Q, and that m = l m T 
with l m G N, an integer. Let {l[ m \ . . . , 1^} denote the empirical equiprobable partitioning 
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associated with the samples {yi, . . . , y m }. Then {l[ m \ . . . , I^} converges to . . . , I L } with 
respect to Q as m — > oo. 
Proof: See [14, Lemma 1]. 

Theorem 12: Let J 7 ) be a measurable space, and let Q be a probability measure on (fi, J 7 ). 
Suppose {ij m \ . . . , /["^} is a sequence of partitions of Q that converges with respect to Q to 
another partition {I 1: I L } as m — > oo. Suppose {xi,...,x„} are i.i.d. samples generated in 
accordance with a probability measure P <^Q, and let P n the empirical measure generated by 
these samples. Then 

lim lim P n (/f m) ) = P(Ii), a.s. Vi. 

m— 5>oo n— >oo 

Proof: See [14, Lemma 2]. 

Before proceeding to the proofs of the two theorems, we express the three mixing coefficients 
in terms of the densities. As stated in (4), we have that 

P(X,Y) = 0.5 / / \f(x,y) - l\dn{x)dv{y). (32) 

Here we take advantage of the fact that the 'density' of fi with respect to itself is one, and 
similarly for v. Next, as in Theorem 3, we can drop the absolute value signs in the definitions of 
a(X, Y) and of 4>(X\Y). Therefore the various mixing coefficients can be expressed as follows: 

&(X,Y) = sup sup / [f(x,y)-i\dv(y)dfji(x), (33) 

T S JSJT 

<P(X, Y) = sup sup / f / ^j^du(y) - lj d^x). (34) 
t s Js Ut v{1 ) 

Now, for each fixed set T, let us define signed measures k t and S T as follows: 

k t(x) = J [f(x,y) - l]dv(y), 

and associated support sets 

A+(T) = {x e R : k t (x) > 0}, B+(T) = {x G R : 5 T (x) > 0}. 

Then it is easy to see that, for each fixed set T, the supremum in (33) is achieved by the choice 
S = A + (T) while the supremum in (34) is achieved by the choice S = B + (T). Therefore 

a(X, F) = sup / K,T(x)d/j,(x) — sup / [K, T (x)]+dfi(x), (35) 
t Ja + (t) t it 
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4>{X\Y) = sup / 5t{x)cL^{x) = sup / (36) 

7' JB+(T) T Jr 

These formulas are the continuous analogs of (20) and (21) respectively. 
Proof of Theorem 9: For a fixed integer L > 2, choose real numbers 

— oo = a < ai < . . . < Ol-i < a L = +oo, 

— oo = 6 < &i < • • • < bi-i < b L = +oo 

such that the semi-open intervals Ii = (a;_i,aj], J, = (6j_i,6j] satisfy 

/i(/ J ) = l/L,z/(J l ) = l/L,2 = l,...,L. 

Now define the equiprobable partition of IR 2 consisting of the L x L grid {IiXjj,i,j = 1, . . . , L}. 
Next, based on the /-length empirical sample {(x±, y±), . . . , (xi, yi)}, construct empirical marginal 
distributions fi for X and v for Y. Based on these empirical marginals, divide both the X-axis 
R and F-axis R into L bins each having nearly equal fractions of the / samples in each bin. This 
gives an empirical L x L partitioning of R 2 , which is denoted by {l\ L ^ x Jj L \i,j = 1, . . . , L}. 
Using this grid, compute the associated empirical joint distribution 9i on R 2 . Then the proof of 
[14, Lemma 1] can be adapted to show that the empirical partition {l\ L ^ x Jj L \i,j = 1, . . . , L} 
converges to the true partition {/j x Jj,i,j = 1, . . . , L} as / — > oo, with respect to the product 
measure fxxu. The only detail that differs from [14] is the computation of the so-called 'growth 
function'. Given a set A C R 2 of cardinality m, the number of different ways in which this 
set can be partitioned by a rectangular grid of dimension L x L is called the growth function, 
denoted by A m . It is shown in [14, Eq. (15)] that when the partition consists of L intervals and 
the set being partitioned is R, then A m is given by the combinatorial parameter 




(m + L)\ 
m\L\ 



It is also shown in [14, Eq. (21)] that 

— log 

m 

where h(-) is defined by 

h(x) = — xlogx — (1 — x) log(l — x), Wx G (0, 1). 




< 2mh(l/L), 
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When R is replaced by M? and a set of L intervals is replaced by a grid of L 2 rectangles, it is 
easy to see that the growth function is no larger than 



A m < 



m + L 
L 



Therefore 

logA r 



< Amh(l/L). 
m 

In any case, since L, the number of grid elements, approaches oo as I — > oo, it follows that the 
growth condition proposed in [9] is satisfied. Therefore the empirical partition converges to the 
true partition as I — > oo. 

Next, let {Ii x Jj,i,j = 1, . . . , L} denote, as before, the true equiprobable L x L gridding of 
M 2 . Suppose that, after / samples (x r , y r ),r = 1, . . . , I have been drawn, the data is put into ki 
bins. Then the expression (32) defining the true /3-mixing coefficient can be rewritten as 

P(X,Y) = 0.5J2J2 / \f(x,v)-l\dli(x)dv(y). 

i=l j=l J h J J 3 

Now suppose / is an exact multiple of hi. Then the empirical estimate based on the k\ x ki 
empirical grid can be written as 

i=l j=l 

where C i3 - denotes the number of samples (x r ,y r ) in the ij-th cell of the empirical (not true) 
equiprobable grid. If I is not an exact multiple of ki, then some bins will have \l/h\ elements 
while other bins will have \l/ki\ + 1 elements. As a result, the term k[ 2 gets replaced by 
(sitj)/l 2 where is the number of samples in if^ and tj is the number of samples in Jj l \ 
Now, just as in [14, Eq. (36) et seq.], the error |/3j — f3(X,Y)\ can be bounded by the sum of 
two errors, the first of which is caused by the fact that the empirical equiprobable grid is not 
the same as the true equiprobable grid (the term e x of [14]), and the second is the error caused 
by approximating an integral by a finite sum over the true equiprobable grid (the term e 2 of 
[14]). Out of these, the first error term goes to zero as / — > oo because, if ki/l — > so that each 
bin contains increasingly many samples, the empirical equiprobable grid converges to the true 
equiprobable grid. The second error terms goes to zero because the integrand in (32) belongs to 
Li(M 2 ,/i x v), as shown in [14, Eq. (37)]. ■ 
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Proof of Theorem 10: The main source of difficulty here is that, whereas the expression for 
/3(X, Y) involves just a single integral, the expressions for a(X, Y) and for <f>(X, Y) involve the 
supremum over all sets T C R. Thus, in order to show that the empirical estimates converge 
to the true values, we must show not only that empirical estimates of integrals of the form 
f R [K T ]+d[i(x) and J R [5 T ] + dfi(x) converge to their correct values for each fixed set T, but also 
that the convergence is in some sense uniform with respect to T. This is where we use the 
boundedness of the density /(•,•). The details are fairly routine modifications of arguments in 
[14]. Specifically, (switching notation to that of [14]), suppose that in their Equation (27), we 
have not just one measure fi, but rather a family of measures fi T , indexed by T, and suppose 
there exists a finite constant c such that for every set S we have Ht(S) < cQ(S). Then it follows 
from Equation (27) et seq. of [14] that 

fPr((ai A aP, a, V a?]) < cQ(( ai A af\ a, Vaf]), VT. 

Therefore 

lim sup //t((oi A a™, a, V a™]) = 0. 

With this modification, the rest of the proof in [14] can be mimicked to show the following: In 
the interests of brevity, define 

r T = [K T } + dfi(x) 

and let f T ,i denote its empirical approximation. Then, using the above modification of the 
argument in [14], it follows that 

lim sup \r T — f T \ = 0. 

I— >OD J 1 

As a consequence, 

lim sup r T = sup r T = a(X, Y). 

/— >00 rp rp 

The proof for the 0-mixing coefficient is entirely similar. ■ 

VII. Concluding Remarks 

In this paper we have studied the problem of estimating the mixing coefficients between two 
random variables. Three different mixing coefficients were studied, namely ct-mixing, /3-mixing 
and 0-mixing coefficients. The random variables can either assume values in a finite set or 
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the set of real numbers. We derived upper and lower bounds for both the a-mixing and the 
0-mixing coefficients. Moreover, in case the marginal distributions of the two random variables 
are uniform, an exact expression was given for the 0-mixing coefficient. This situation arises 
when empirically generated samples are binned using percentile binning. We also proved analogs 
of the data-processing inequality from information theory for each of the three kinds of mixing 
coefficients. Then we moved on to real- valued random variables, and showed that, even though 
the empirically estimated joint distribution converges to the true joint distribution, estimates of 
the f3- or 0-mixing coefficients based on the empirical joint distribution converges to 1 under mild 
conditions. However, by using percentile binning and allowing the number of bins to increase 
more slowly than the number of samples, we can generate empirical estimates that converge to 
the true values. 

In general both the a- and the 0-mixing coefficient are solutions of integer programming 
problems, as shown in (17) and (19) respectively. So it would be interesting to explore whether 
the computation of these mixing coefficients is NP-hard. Also, it is clear from these same 
two equations that it is possible to construct a convex programming relaxation of (17) and a 
nonlinear programming relaxation of (19). It would be interesting to analyze how close, if at 
all, the solutions of these relaxed problems are to those of the original integer programming 
problems. 

The later parts of the paper [14] contain some proposals on how to speed up the convergence 
of the empirical estimates of the Kullback-Leibler divergence between two unknown measures. 
It might be worthwhile to explore whether similar speed-ups can be found for the algorithms 
proposed here for estimating mixing coefficients from empirical data. 
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